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Abstract 


This  paper  presents  an  approach  to  model  selection  for  regularized  least-squares  on 
reproducing  kernel  Hilbert  spaces  in  the  semi-supervised  setting.  The  role  of  effective 
dimension  was  recently  shown  to  be  crucial  in  the  definition  of  a  rule  for  the  choice 
of  the  regularization  parameter,  attaining  asymptotic  optimal  performances  in  a 
minimax  sense.  The  main  goal  of  the  present  paper  is  showing  how  the  effective 
dimension  can  be  replaced  by  an  empirical  counterpart  while  conserving  optimality. 
The  empirical  effective  dimension  can  be  computed  from  independent  unlabelled 
samples.  This  makes  the  approach  particularly  appealing  in  the  semi-supervised 
setting. 
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1  Introduction 


The  semi-supervised  setting  in  statistical  learning  theory  has  been  investigated  in 
various  recent  papers  [3],  [2], [19].  The  interest  for  this  problem  is  especially  moti¬ 
vated  by  the  large  variety  of  applications  where  a  large  amount  of  unlabelled  data 
are  available,  but  for  which  the  process  of  labelling  may  be  expensive  or  imprac¬ 
tical.  The  practitioner  is  then  faced  with  the  problem  of  somehow  exploiting  all 
the  available  information  on  the  phenomenon  coded  in  the  unlabelled  data.  Tradi¬ 
tionally  statistical  learning  theory  has  mostly  studied  the  learning  process  in  the 
so-called  supervised  setting  [17], [12], [7], [6],  [5], [14]  where  a  set  of  input-output  cou¬ 
ples  is  given.  It  is  clear  that  unlabelled  data  give  some  extra  information  regarding 
the  marginal  probability  distribution  on  the  input  space.  A  natural  starting  point 
to  a  theoretically  founded  approach  to  semi-supervised  learning  is  the  analysis  of 
the  optimal  rates  achievable  when  the  marginal  distribution  is  known.  This  was 
the  main  goal  of  [4]  and  [8]  where  a  criterion  for  the  choice  of  the  regularization 
parameter  for  regularized  least-squares  (RLS)  on  reproducing  kernel  Hilbert  spaces 
(RKHS)  was  shown,  leading  to  optimal  rates  in  a  marginal  dependent  minimax 
sense.  In  that  case  the  optimal  regularization  parameter  was  expressed  in  terms  of 
the  effective  dimension ,  the  trace  of  a  certain  operator  defined  by  the  kernel  and 
the  marginal  distribution  itself. 

This  paper  considers  the  following  natural  step  in  the  analysis  of  semi-supervised 
learning:  exploiting  unlabelled  data  in  order  to  replace  effective  dimension  with  an 
empirical  version  of  it  while  conserving  asymptotically  optimal  performances. 

The  plan  of  the  paper  is  as  follows.  In  section  2  we  briefly  recall  the  main  concepts 
of  statistical  learning  and  define  the  RLS  algorithm  on  RKHS.  We  also  overview  the 
main  result  of  [4]  giving  a  rule  for  the  optimal  choice  of  the  regularization  parameter 
in  terms  of  the  effective  dimension.  In  section  3  we  define  the  empirical  counterpart 
of  effective  dimension.  This  can  be  expressed  quite  naturally  by  the  empirical  kernel 
matrix  associated  to  a  set  of  independent  unlabelled  data.  The  main  result  of  this 
section  is  a  concentration  result  relating  empirical  effective  dimension  and  effective 
dimension.  Finally  in  section  4  we  generalize  the  main  theorem  of  [4]  to  the  empirical 
case,  prove  asymptotic  optimality,  and  present  a  sketch  of  an  explicit  procedure 
that  can  achieve  optimal  rates  when  enough  independent  unlabelled  samples  are 
available.  Let  us  stress  that  the  procedures  presented  here  have  not  been  designed 
to  be  computationally  effective  but  rather  to  be  simple  and  instructive.  In  fact  the 
aim  of  this  analysis  is  focusing  on  the  theoretical  issues  that  should  be  considered 
while  developing  model  selection  techniques  in  the  semi-supervised  setting. 
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2  Learning  Theory 


We  consider  a  compact  input  space  X  C  IR'1  and  an  output  space  Y  =  [— M,  M]  C 
1R.  The  space  Z  =  XxY  is  endowed  with  a  probability  measure  p(x,  y )  =  px(x)p(y\x), 
where  px{x)  denotes  the  marginal  probability  measure  on  X  and  p(y\x)  the  con¬ 
ditional  probability  measure  of  y  given  x.  The  probability  measure  p  is  fixed  but 
unknown.  The  data  we  are  given  is  a  training  set  of  ft  pairs  of  examples  z  =  (x,  y )  = 
{(xi,  yi)}l=\  drawn  i.i.d.  with  respect  to  p.  that  is  z  6  Ze.  Roughly  speaking  the 
goal  of  learning  is  to  design  a  procedure  that,  given  the  training  set  z,  provide  us 
with  a  function  fz  that  will  correctly  estimate  the  label  y  given  new  points  x,  i.e. 
we  want  /z  to  generalize.  In  this  paper  we  analyze  regularized  least-squares  (RLS) 
algorithm  when  the  hypothesis  space  is  chosen  to  be  a  reproducing  kernel  Hilbert 
space  (RKHS).  For  any  given  A  >  0  and  training  set  z,  RLS  algorithm  defines  an 
estimator  //  as  the  solution  of  the  following  minimization  problem 


min 

fen 


-  f(xi))2  +  X 


H 


}, 


1=1 


(1) 


where  ||  •  \\n  is  the  norm  in  the  RKHS  7i  [1] .  Roughly  speaking  the  first  term  measures 
how  much  the  estimator  /  fits  the  data  whereas  the  second  term  is  a  penalization 
which  constraints  the  complexity  of  the  solution.  The  parameter  A  balances  out  the 
two  terms.  The  intuition  behind  the  algorithm  is  that  the  regularization  parameter 
A  allows  to  pass  from  overfitting  to  oversnroothing  so  that  a  good  choice  of  the 
regularization  parameter  on  the  basis  of  the  given  data,  A  =  A(z,£),  allows  to 
prevent  both.  In  this  sense  we  can  think  of  the  regularization  parameter  choice  as 
a  model  selection  procedure*.  The  question  is  then  how  to  choose  A  in  order  to 
obtain  good  generalization  properties.  To  formalize  the  problem  we  can  consider 
the  squared  loss  function  (y  —  f(x))2  and  introduce  the  expected  loss 


I[f\  =  f  (y  ~  f(x))2dp(x,y). 

JXxY 

We  assume  that  the  above  functional  admits  a  minimizer  on  Ti  that  we  denote  with 
/ft.  If  Tt  is  dense  **  in  the  space  of  square  integrable  functions  with  respect  to  p, 
then  /ft  is  the  regression  function 


fP  =  J^ydp(y\x)- 

In  this  case  the  problem  is  to  find  an  estimator  whose  error  is  close  to  that  of  /ft. 
In  this  paper  we  study  a  consistency  property  of  RLS,  in  fact  we  want  to  define  a 
choice  for  A^  =  A(£)  such  that  for  every  e  >  0 


lim  P 

£ — >oo 


I[f^]-I[fp]>e 


=  0. 


*  Though  it  might  happen  that  no  explicit  structure  of  models  (spaces)  is  consid¬ 
ered. 

**This  is  the  case  for  universal  kernels  (for  example  for  gaussian  kernel)  and  we 
refer  to  [15]  for  details  on  the  subject. 
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While  studying  consistency  the  crucial  issue  is  indeed  the  convergence  rate.  In  fact, 
this  gives  information  on  the  finite  sample  behavior  of  the  considered  algorithm. 
Unfortunately  there  are  classic  results  [11]  showing  that  convergence  rates  are  ob¬ 
tainable  only  under  suitable  assumptions  on  the  probability  distribution  underlying 
the  learning  problem.  Hence  as  we  look  for  convergence  rates  we  previously  have 
to  clarify  the  class  of  probability  distributions  A4  to  which  we  restrict  our  analysis. 
The  natural  question  arising,  before  starting  the  consistency  analysis  of  a  given  al¬ 
gorithm,  is  that  of  the  best  attainable  rates  under  the  prior  assumption  p  E  Jv[.  The 
answer  to  this  question  is  then  given  in  term  of  minrnax  optimality  results,  that  is 
studying  lower  bounds  of  the  quantity 

inf  sup  [E(/[/z]  -  /[/ft])] 
h  peM 

where  the  infimum  is  taken  w.r.t.  all  the  possible  learning  algorithms  z  — ►  /z. 
Usually  the  class  Af  is  characterized  through  some  assumption  on  the  minimizer 
/ft,  for  example  smoothness  or  approximability  properties.  In  [4]  upper  and  lower 
bounds  for  RLS  are  proposed  in  the  case  where  /ft  has  approximability  properties 
in  a  RKHS. 

After  recalling  some  basic  concepts  about  RKHS  we  briefly  review  the  main  results 
in  [4]  that  we  develop  in  the  following  sections. 


2.1  RKHS  and  Covariance  Operators 


We  briefly  recall  some  ideas  on  RKHS  we  use  in  the  following  (see  [1]  for  a  broader 
introduction  to  the  subject).  A  RKHS  is  a  Hilbert  space  of  functions  uniquely 
defined  by  a  symmetric  positive  definite  function  K  :  X  x  X  — *  H,  namely  the 
kernel.  We  say  that  K  is  positive  definite  if  for  all  m  >  0,  x\,...xm  E  X  and 
ci ,  •  •  ■  cm  E  1R  the  following  inequality  holds 

m 

Y  CiCjK(xi,Xj)  >  0. 
i,j=  1 

We  will  assume  throughout  that  the  kernel  is  bounded,  that  is 

sup  K(x,  x)  <  K. 

x&X 

It  will  be  useful  to  recall  that  the  following  operators  are  naturally  defined 

•  Covariance  operator  T  :  — ►  7/ 

T  :=  [  {-,Kx)HKxdpx{x). 

Jx 

•  Empirical  covariance  operator  Tx  :  Tt  — >  Tt. 

1£ 

Tx :=  j  (■>  ^i)ft  KXi, 
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where  we  set  Kx  =  K(-,x)  and  x  =  (®*)i= 1-  The  operators  T  and  Tx  can  be  proved 
to  be  positive,  self-adjoint,  Hilbert-Schmidt  and  trace-class  (see,  for  example,  the 
appendix  in  [10]). 

2.2  Optimal  a  Priori  Choice  for  Regularized  Least  Squares 

We  now  recall  the  results  in  [4]  about  RLS  algorithm  that  we  develop  in  the  following 
sections. 

Since  we  look  for  convergence  rates  we  first  clarify  the  assumptions  on  the  proba¬ 
bility  measure  p  we  consider.  To  this  aim  we  define  the  family  of  priors 

P{a,R)  :={f  eH\  T~af  G  H  with  \\T~af\\n  <  R}, 

where  a  G  (0, 1/2]  and  R  >  0.  Moreover  we  consider  the  population  version  of  the 
RLS  algorithm  and  let  fx  be  the  solution  of  the  problem 

min{  [  (y~  f{x))2dp{x ,  y)  +  X  \\f\\^}. 

J^H  JXxY 

If  we  let  ||-||  be  the  norm  in  the  space  of  square  integrable  functions  with  respect 
to  p,  according  to  [4]  we  can  define  the  following  quantities: 

A(X)  :=  fx  -  fn  2  B(X)  :=  fx  -  fn 

P  n 

measuring  the  approximation  error  in  the  norm  ||-||  and  in  the  norm  ||-||w  respec¬ 
tively.  Moreover  we  define  the  effective  dimension 

Af(X)  :=  Tr[(T  +  A)-1T] 

which  plays  the  role  of  a  capacity  measure  for  the  RLS  algorithm.  When  G 
P(a,R)  the  following  inequalities  hold 

A(X)  <  Xc  \\T~af\\2n  ,  B(X)  <  Ac_1  \\T~af\\2n  ,  (2) 

where  c  =  2a  +  1  (see  Lemma  6  in  [4]).  Moreover  if  the  eigenvalues  (tn)™=1  of  the 
operator  T  fulfill  tn  =  0(n~b)  for  some  b  >  0  then 

RT(X)  =  0(X-1/b),  (3) 

see  again  Lemma  6  in  [4].  Finally  we  recall  that  the  stochastic  order  symbol  is 
defined  by  the  following  equivalence  [16] 

X(  =  Op[hf)  lim  lim  supP  [  \Xg\  >  Dh#\  =  0. 

D^O  i— >oo 

The  following  theorem  summarizes  the  main  result  in  [4]. 

Theorem  1  Let  z  be  a  training  set  drawn  i.i.d  according  to  p  and  ff  the  RLS 
estimator. 
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(1)  Let  0  <  77  <  1 .  If 


t  max{AA(A),  ^2/Cv} 

then  with  probability  greater  than  1  —  r/, 


I\fi ]  -  nfn]  <  Cv  A(X)  + 


k2B( A)  i  acAI(A) 

I2\ 


+ 


IX 


kM  MAf(X)  A 

+  ¥x+  I  ) 


where  Cv  =  1281og2(8/?y). 

(2)  Assume  fp  G  iF(a,R),  a  G  (0,1/2]  and  that  the  eigenvalues  (tn)ff=1  of  the 
operator  T  fulfill  tn  =  0(n~b )  for  some  b  >  0.  If  we  choose  the  unique  value 
Xo  =  Ao(^)  such  that 

AA(Ao)  =  IXl 

then 

I[f^}~I[fn]=Op(r^i)  (4) 

where  c  =  2a  +  1. 

Remark  1  The  rate  in  (4)  can  be  shown  to  be  optimal  in  a  minmax  sense  with 
respect  to  the  considered  class  of  probability  distributions  [4]. 

Remark  2  If  the  hypothesis  space  H  is  finite  dimensional  then  the  convergence 
rate  is  i~x . 


3  Empirical  Effective  Dimension 


In  this  section  we  show  that  effective  dimension  can  be  empirically  estimated  from 
a  set  of  unlabelled  data. 

Definition  1  Let  x  =  (x)^=1  a  set  of  m  inputs.  We  define  the  empirical  effective 
dimension  as 

AAX(A)  :=  TV[(TX  +  A)-1TX], 


The  main  result  of  this  section  is  the  following  concentration  result  relating  the 
effective  dimension  to  the  empirical  effective  dimension. 

Theorem  2  Let  X  >  0,  m  G  N  and  x  =  (x)rf=l  a  set  of  m  input  values  drawn  i.i.d. 
according  to  px ■  Let  0  <  r/  <  1,A  >  0  */ 

m>r(A,,,A):=2^(l+/)l%. 

the  following  inequality  holds  with  probability  1  —  77 

\Af(X)  -  A/"x(A)|  <  A. 
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3. 1  Proof 


To  prove  the  above  theorem  we  need  the  following  probabilistic  inequality  for  ran¬ 
dom  variables  in  Hilbert  spaces  due  to  [13].  We  use  in  particular  the  following  simple 
restatement  of  of  Th.  3.3.4  of  [18],  whose  proof  can  be  found  in  [4]. 

Lemma  3  Let  (Lt,tF,P)  be  a  probability  space  and  f  be  a  random  variable  on  H 
taking  values  in  a  real  separable  Hilbert  space  K.  Assume  that  there  are  two  positive 
constants  H  and  a  such  that 


ll«^)llK<f  a.s 

E[l|{|ll]<^2. 

Let  £  6  N  and  0  <  ij  <  1 ,  then 


(5) 

(6) 


(<xq,  .  .  .  ,  UJf)  £ 


1 


i=  1 


K 


<|-+^)21o% 


(7) 


We  can  now  prove  Theorem  (2). 


PROOF.  We  first  claim  that 

\M[X)  -  Wx(A)|  =  |  Tr((T  +  A)_1T  -  (Tx  +  A)_1)|  <  AlVi(x)  +  A lV2(x),  (8) 

where 

AJVi(x)  =  ^  Tr(T  -  Tx)  and  AA2(x)  =  g  \\T  -  Tx|| 

where  ||-|j  is  the  norm  in  the  banach  space  of  linear  bounded  operators  form  74  to 
74.  We  start  by  considering  the  following  simple  algebraic  equalities 

(T  +  A)-1T-(TX  +  A)-1TX  =  (9) 

(T  +  A)_1(T  -  Tx)  +  [(T  +  A)'1  -  (Tx  +  A)_1]TX  = 

(T  +  A)_1(T  -  Tx)  +  (T  +  A)_1(TX  -  T)(TX  +  A)_1TX. 

Recalling  that  (see  [9]  for  a  proof) 

||(T  +  A)-1||  <  i  IK^  +  A)-1!!  <  J 

we  have 

Tr((T  +  A)_1(T  -  Tx))  <  j  TV(T  -  Tx)  (10) 

where  we  used  the  fact  that  Tr(AB)  <  ||H||  Tr(B )  if  A  and  B  are  self-adjoint. 
Moreover  the  definition  of  Tx  and  Schwartz  inequality  imply 
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(11) 


1 

m 


E 


((T  +  X)-\Tx 


T V((T  +  A )"1(TX  -  T)(TX  +  A)_1TX) 
T)(TX  +  \)~1KXi,KXi)n  <  J  || T-  Tx ||  , 


and  (8)  follows  by  taking  the  trace  of  (9)  and  plugging  in  Inequalities  (10)  and 

(11). 

To  finish  the  proof  we  need  to  give  probabilistic  bounds  on  AA)  (x)  and  AAA^x) , 
to  this  purpose  we  are  going  to  use  Lemma  3.  We  first  give  the  bound  on  AAi(x). 
Let  us  consider  the  random  variable  £i  :  X  — >  1R 


6(x)  :=Tr({-,Kx)nKx). 


It  is  straightforward  to  check  that 


M^OI  <  k 


and  moreover 

1  m 

-^i(xi)  =  Tr(Tx)  E[6]  =  Tr  (T). 
m 

We  can  then  apply  Lemma  3  with  H  =  a  =  k  to  get  with  probability  greater  than 
1  =  77 

AiVi(x)  <  «ln  -  ( —  H — E  )  • 
r]  \m 

Finally  we  study  the  bound  on  AA^x).  Recall  that  both  T  and  Tx  are  Hilbert- 
Schmidt  operators  so  that  we  can  apply  Lemma  (3).  If  we  let  be  the  space 

of  Hilbert-Schmidt  operators  from  Ti  to  7i  and  we  denote  with 
Schmidt  norm  the  uniform  norm  is  dominated  by  the  norm  1 1  •  1 1  (>c)  •  Then  we  can 

introduce  the  random  variable  £2  :  X  — ►  £2  (W) 

£2(2-)  :=  (")  Kx)t-i  ^x- 

Again  it  is  easy  to  check  that 

1 1  £2 1 1  £2  (7"0  —  K 

and  moreover 

1  m 

-V6(^)  =  TX  E[6]  =  T. 

m 

i=i 

Applying  Lemma  3  with  H  =  a  =  k  we  get  with  probability  greater  1  =  7j 

AA^x)  <  Kin  -  ( - 1 — =  )  , 

77  \  m  y  m J 


and  the  theorem  is  proved. 


4  Optimal  parameter  choice  in  Semi-supervised  Setting 


In  Theorem  1  to  define  an  optimal  a  priori  choice  for  the  regularization  parameter 
for  a  given  prior  we  need  to  know  the  effective  dimension  JV’(A).  In  a  semi-supervised 
setting  we  can  use  the  concentration  result  of  the  previous  section  to  replace  jV(A) 
with  an  empirical  estimate  based  on  unlabelled  data.  The  goal  of  this  section  is  to 
show  that  using  such  an  estimate  we  can  define  a  data-dependent  parameter  choice 
achieving  the  optimal  convergence  rate. 


4-1  Main  Result 


Recall  that  if  we  know  JV(A)  we  can  choose  the  value  Ao  according  to  Theorem 
1  to  achieve  the  optimal  rate.  The  idea  behind  our  parameter  choice  is  to  replace 
jV(A)  with  an  approximation  based  on  unlabelled  data.  Roughly  speaking,  to  ensure 
that  the  parameter  choice  based  on  unlabelled  data  is  still  optimal  we  have  to 
suitably  control  the  quality  of  the  empirical  estimate  for  JV’(A).  To  clarify  this  we 
let  0  <  cm  <  1  <  a+  be  two  fixed  constants  and  define  the  values  A^  and  Xj  such 
that 

a+M{Xj)  =  £{Xj)c  and  a-Rf(Xj)  =  (12) 

It  is  possible  to  show  that  if  we  choose  either  A)!"  or  A J  we  get  the  same  convergence 
rate  as  choosing  Ao-  Intuitively  we  want  our  estimates  for  JV’(A)  to  lie,  with  high 
probability,  between  a+AA(A)  and  a-Af(X)  for  each  value  of  A.  In  this  case  we  expect 
to  be  able  to  select  A  so  that  the  good  asymptotic  properties  are  maintained. 

We  now  formalize  the  above  idea.  The  first  step  toward  the  definition  of  our  param¬ 
eter  choice  rule  is  to  consider  a  suitable  discretization  criterion  for  A.  This  is  most 
reasonable  from  a  practical  point  of  view  and  will  not  prevent  us  to  obtain  optimal 
convergence  results.  The  following  assumption  describe  the  discretization  procedure 
that  we  are  going  to  consider. 

Assumption  1  We  discretize  the  possible  values  for  the  regularization  parameter 
considering  0  <  A  &  <  Afc_i  with  k  =  1,2,...  such  that 

Xk  >  qXk-i-  (13) 

The  following  assumption  describes  the  regularization  parameter  choice  we  consider. 

Assumption  2  We  assume  the  index  k(£)  €  N  be  such  that  if  we  let  X :=  A j-m 
and  Xj  =  X then,  the  following  conditions  hold  true 

a-W(A+)  <  l(Xj)c  and  a+M(Xf)  >  £{Xf)c.  (14) 

In  Section  4.3  we  show  how  to  actually  find  A^  in  an  iterative  way.  Next  theorem 
shows  that  choosing  A  =  A^  we  can  actually  achieve  the  same  rate  of  the  optimal 
value  Aq. 
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Theorem  4  Under  the  same  hypotheses  of  Theorem  1  Item  2,  if  Assumption  1 
holds  and  the  random  variables  (k(£))ee^,  with  values  on  N,  fulfill  Assumption  2  with 
probability  greater  than  1  —  fj(£)  for  some  fj(£)  — >  0.  Then  defining  X(  :=  X \  :=  \u(\ 
one  has 

I[f^}-I[fn\  =  Op(r 4f)  (15) 

where  c  =  2a  +  1. 


4-2  Proof 

Before  giving  the  proof  of  Theorem  (4)  we  collect  a  few  simple  results  on  our  pa¬ 
rameter  choice  in  the  following  Lemma. 

Lemma  5  Let  ,  Xf  as  in  Assumption  2  and  ,  Xf  as  in  (12).  Then 

(1)  the  following  inequalities  hold 

Xj  >  Xf  and  Xf  <  A+.  (16) 

(2)  X(f  — ►  0  as  £  — >  oo. 

(3)  The  following  inequality  holds  true 

<A+r + m(\ t)n  <  (gc + (a.y'wr  av 


PROOF.  We  first  prove  Item  1  by  contradiction.  If  we  let  A^  <  Xf  then 


Af(A+)  >N{ Xf) 


so  that  by  Assumption  2 

a-M(Xf)  <  £{X +)c  <  £(Xf)c 

which  is  impossible  because  of  (12).  Similarly  one  can  prove  that  Xf  <  Xf.  The 
proof  of  Item  2  follows  from  Item  1  and  Assumption  1,  In  fact  we  have 

Xj  <  qXf  <  qXj 

and  the  proof  follows  since  A)!"  — >  0  as  £  — »  oo.  Finally  from  Assumption  1  and  Item 
1  we  have 

(a  tr  <  vhir  <  ?c(a  tr 

and  from  Item  1  and  (12) 

m+)it  <m~i)it= <(a-Y'at)c- 


10 


Putting  the  above  inequalities  together  we  get  (17) 


We  are  now  ready  to  prove  Theorem  (4). 


PROOF.  [Theorem  4]  The  proof  is  similar  to  that  of  Theorem  (1)  Item  2  (see  [4]). 
We  assume  that  k(£)  is  a  random  variable  fulfilling  Assumption  2  with  confidence 
level  1  —  rj(£)  where  rj(£)  — >  0  as  £  — >  oo.  Recall  that  c  >  1,  if  we  let  0  <  r)  <  1  then 
with  confidence  level  at  least  1  —  fj(£) 

fXj  =  £(A+)1"C(A/)C  >  (A/)1_ca_A7(A^). 

Since  1  —  c  <  0,  from  Lemma  (5)  Item  2  we  know  that  it  exists  £{rj)  G  N  such  that 


£  <  nrax{A7(A+),  \J2/CV} 


<  rj(£) 


for  £  >  £(if).  If  we  now  consider  to  choose  the  value  Xl  then  we  have 


>  Cv  A(\t)  + 


k2B(  Xg)  kA(X^) 


£2Xj 


+ 


£Xj 


kM 

M++  e  I 


x+ 


<  r] 


where  Cv  =  1281og2(8/?y)  and  Xu  =  I[fze  ]  —  I[fn\-  Using  Lemma  6  in  [4]  we  can 
simplify  the  form  of  the  above  bound.  In  fact  it  is  easy  to  show  (see  the  proof  of 
Theorem  (1)  in  [4])  that  asymptotically  the  first  and  the  last  term  in  the  bound 
prevail  so  that  a  positive  constant  C  and  a  natural  number  £  (rj)  exist  for  which 


Xg_  >  C'CV  A(Xj)  + 


■Af(A+A 


A 


<  V, 


\/£>£{rj). 


If  we  now  apply  (2)  and  Lemma  (5)  Item  3,  we  can  rewrite  the  above  bound  using 
stochastic  order  symbol  [16].  In  fact  a  positive  constant  C  exist  for  which 

P  [Xt  >  D{ A+)c]  <  8 e~^D/128C"  +rj(£), 
if  D  >  128 C'\qc  +  (a_)_1)  log2  8  and  £  >  (' (D),  then  we  have 

nX]  -  /[/«]  =  op  «a+)c)  . 


Recalling  (see  again  Lemma  6  in  [4])  that  if  the  eigenvalues  of  T  satisfy  tn  =  0(n  b) 
then  A7(A)  =  0(A_s),  from  the  definition  of  A^  we  have 

<(A+r  =  0((A+)-i) 


which  implies  (A^")c 


0{£  6c+1 )  and  the  theorem  is  proved. 
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4-3  Model  Selection  from  Unlabelled  Data 


In  this  section  we  present  an  explicit  procedure  to  find  from  unlabelled  data  the 
index  k{t)  satisfying  Assumption  2  with  a  given  confidence  level  1  —  The 

corresponding  regularization  parameter  choice  X(  :=  Xf  :=  Afcm  plays  the  central 
role  in  Theorem  4.  The  fundamental  condition  to  accomplish  our  scheme  is  to  have 
unlabelled  data  available,  from  now  on  we  indicate  with 

unlabelled(m)  — >  x  with  |x|  =  m  (18) 

the  procedure  providing  us  with  m  £  N  unlabelled  examples. 

First  we  describe  the  procedure  that  for  each  value  of  A  provides  us  with  the  ap¬ 
proximation  of  Af(X)  we  need  for  a  fixed  confidence  level  1  —  77  and  relative  error 
0  <  5  <  1.  We  let 

r(A,  77,  A) 

as  in  Theorem  (2)  and  recall  that 

A4(A)  =  TV((TX  +  A/)_1TX)  =  Tr((K  +  A/)_1K). 

where  K ij  =  k(xi,Xj).  We  now  first  give  the  procedure  eff_dim(A ,77)  and  then 
briefly  explain  it. 


ef  f  _dim(A,  ??) 

•  j  =  1 

•  do  unlabelled(r(2_:?,  2  ^+1)r/,  A))  — >  x;j+=l 

•  until  A/rx(A)  >  2~i+1 

•  unlabelled(r(2_-7d,  2 — 1 77 ,  A))  — >  x 

•  return  A/"X(A) 


It  is  easy  to  show,  applying  theorem  2,  that  with  probability  greater  than  1  —  77, 
J\T,  the  output  of  ef  f  _dim(A,  ??) ,  is  bounded  from  above  and  from  below  in  terms 
of  AA(A),  formally  we  have 

P  [  (1  -  5)J\f(X)  <  M  <  (1  +  <5)AA(A)]  >1-7?, 

where  5  is  the  constant  appearing  in  the  text  of  eff  _dim. 

eff_dim  is  called  by  the  procedure  mod_sel  given  below.  mod_sel  (£,  7?)  returns 
the  integer  k(l)  fulfilling  with  confidence  level  1  —  7?  Assumption  2  used  in  the 
previous  section.  The  idea  behind  the  procedure  is  simply  exploring  the  grid  (A k)k 
until  a  crossing  between  the  approximation  term  IXC  and  our  estimate  of  AA(A) 
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is  encountered.  Clearly  this  strategy  is  performed  while  properly  controlling  the 
accuracy  of  the  estimate  of  A7(A)  and  its  confidence  level. 


mod_sel  (l,  77) 


•  k,j  =  1 

•  a  =  sign(ef  f  _dim(Afc,  2~1r/(£))  -  t{\ k)c) 

•  do  k  =  k  +  <7;  j+  =  1 

•  until  <r(ef f _dim(Afc,  2~hj(£))  —  7(Afc)c)  >  0 

•  return  k  + 
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